notebooks/tutorials/bigquery/Getting started with BigQuery ML Feature Engineering.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "dsDV5jNzgbGl" }, "source": [ "##### Copyright 2019 Google LLC" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "both", "colab": {}, "colab_type": "code", "id": "Y4UiAxk2gPnp" }, "outputs": [], "source": [ "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "mHF9VCProKJN" }, "source": [ "# Getting started: Training and Prediction with BigQuery ML\n", "\n", "\n", "<img src=\"https://wwwstatic-c.lookercdn.com/partner_network/pages/google/bigquery_v0002.png\" width=\"120\" align=\"middle\"/>" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "hZzRVxNtH-zG" }, "source": [ "## Overview\n", "\n", "This tutorial shows how to train and evaluate a Binary Classification model using BigQuery ML (BQML), we also will use this model to serve predictions.\n", "\n", "The first two parts of the tutorial walk through extracting the data, preparing it before you train your model.\n", "\n", "The last part of the tutorial digs into the training code used for this model, with a particular focus on requirements for making it compatible with BigQuery ML." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iN69d4D9Flrh" }, "source": [ "### Dataset\n", "\n", "This tutorial uses the [United States Census Income\n", "Dataset](https://archive.ics.uci.edu/ml/datasets/census+income) provided by the\n", "[UC Irvine Machine Learning\n", "Repository](https://archive.ics.uci.edu/ml/index.php). This dataset contains\n", "information about people from a 1994 Census database, including age, education,\n", "marital status, occupation, and whether they make more than $50,000 a year.\n", "\n", "Dataset now exists in [BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data/):\n", "\n", "```\n", "`bigquery-public-data.ml_datasets.census_adult_income`\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Su2qu-4CW-YH" }, "source": [ "### Goal\n", "\n", "The goal is to train a Binary Classification model using BigQuery ML that predicts\n", "whether a person makes more than $50,000 a year (target label) based on other\n", "Census information about the person (features).\n", "\n", "This tutorial focuses more on using this model with BigQuery ML than on\n", "the design of the model itself. However, it's always important to think about\n", "potential problems and unintended consequences when building machine learning\n", "systems. See the [Machine Learning Crash Course exercise about\n", "fairness](https://developers.google.com/machine-learning/crash-course/fairness/programming-exercise)\n", "to learn about sources of bias in the Census dataset, as well as machine\n", "learning fairness more generally." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "rgLXkyHEvTVD" }, "source": [ "## Before you begin\n", "\n", "You must do several things before you can train a model in BigQuery ML:\n", "\n", "* Set up a Google Cloud Platform (GCP) project with billing and the necessary APIs enabled. Take a look here for a [Free Trial](https://cloud.google.com/free)\n", "* Set up your development environment. (Skip this step if you're using\n", "Colaboratory.)\n", "* Authenticate your GCP account in this notebook.\n", "* Create a Google Cloud Storage bucket to store your training data. \n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "NOAbQC43NJM-" }, "source": [ "### Costs\n", "This tutorial uses billable components of Cloud Platform, including:\n", "\n", "- Google BigQuery\n", "- BigQuery ML\n", "\n", "Learn about [Google BigQuery\n", "pricing](https://cloud.google.com/bigquery/docs/controlling-costs) and [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n", "Calculator](https://cloud.google.com/bigquery/docs/estimate-costs)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "SZFttqMsQT9_" }, "source": [ "### Set up your development environment\n", "\n", "**If you are using Colab**, your environment already meets\n", "all the requirements to run this notebook. You can skip this step.\n", "\n", "**Otherwise**, make sure your environment meets this notebook's requirements. You need the following:\n", "\n", "* The Google Cloud SDK\n", "* Python 3\n", "* virtualenv\n", "* Jupyter notebook running in a virtual environment with Python 3\n", "\n", "The Google Cloud guide to [Setting up a Python development\n", "environment](https://cloud.google.com/python/setup) and the [Jupyter\n", "installation guide](https://jupyter.org/install) provide detailed instructions\n", "for meeting these requirements. The following steps provide a condensed set of\n", "instructions:\n", "\n", "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n", "\n", "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n", "\n", "3. [Install\n", " virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n", " and create a virtual environment that uses Python 3.\n", "\n", "4. Activate that environment. Run `pip install jupyter` in a shell to install\n", " Jupyter.\n", "\n", "5. Run `jupyter notebook` in a shell to launch Jupyter.\n", "\n", "6. Open this notebook in the Jupyter Notebook Dashboard." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4GYu05KjLYNG" }, "source": [ "## Set up your GCP project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager)\n", "\n", "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "\n", "3. BigQuery is automatically enabled in new projects. [To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.](https://console.cloud.google.com/flows/enableapi?apiid=bigquery)\n", "\n", "4. Enter your project ID in the cell below.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "colab": {}, "colab_type": "code", "id": "4qxwBA4RM9Lu" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-gpc-project]\" # @param {type:\"string\"}\n", "! gcloud config set project $PROJECT_ID" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fZQUrHdXNJnk" }, "source": [ "### Authenticate your GCP account\n", "\n", "**If you are using Colab**, run the cell below and follow the instructions\n", "when prompted to authenticate your account via oAuth.\n", "\n", "**Otherwise**, follow these steps:\n", "\n", "1. In the GCP Console, go to the [**Create service account key**\n", " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", "\n", "2. From the **Service account** drop-down list, select **New service account**.\n", "\n", "3. In the **Service account name** field, enter a name.\n", "\n", "4. From the **Role** drop-down list, select\n", " **BigQuery > BigQuery Admin** and\n", " **Storage > Storage Object Admin**.\n", "\n", "5. Click *Create*. A JSON file that contains your key downloads to your\n", "computer.\n", "\n", "6. Enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "W9i6oektpgld" }, "outputs": [], "source": [ "import sys\n", "\n", "# If you are running this notebook in Colaboratory, run this cell and follow the\n", "# instructions to authenticate your GCP account. This provides access to your\n", "# Google Cloud Storage bucket and lets us submit training jobs and prediction\n", "# requests.\n", "\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth as google_auth\n", "\n", " google_auth.authenticate_user()\n", "\n", "# If you are running this notebook locally, replace the string below with the\n", "# path to your service account key and run this cell to authenticate your GCP\n", "# account.\n", "else:\n", " % env GOOGLE_APPLICATION_CREDENTIALS '/path/to/your/service-account-key.json'" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GKP7F6-EDb5_" }, "source": [ "## ML Workflow using a BigQuery model\n", "\n", "In this section we will build a Keras model from scratch.\n", "We will perform the following steps:\n", "- Data preparation\n", "- Model creation\n", "- Model training\n", "- Model evaluation\n", "- Model serving\n", "\n", "We will create the model and export it to serve requests in BigQuery ML." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-VtUN0L5x4ql" }, "source": [ "### Import libraries\n", "Import supporting modules:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "RcxfR3GfscsA" }, "outputs": [], "source": [ "# Data processing\n", "import pandas as pd\n", "\n", "# Visualizations\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# BigQuery API\n", "from google.cloud import bigquery\n", "\n", "# Show software versions\n", "print(__import__(\"sys\").version)\n", "print(pd.__version__)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "xWZQbZQmx26U" }, "source": [ "### Define Constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ox2nu_wGaTPb" }, "outputs": [], "source": [ "# These are the features in the dataset.\n", "# Dataset information: https://archive.ics.uci.edu/ml/datasets/census+income\n", "# which exists now in: `bigquery-public-data.ml_datasets.census_adult_income`\n", "_COLUMNS = [\n", " \"age\",\n", " \"workclass\",\n", " \"functional_weight\",\n", " \"education\",\n", " \"education_num\",\n", " \"marital_status\",\n", " \"occupation\",\n", " \"relationship\",\n", " \"race\",\n", " \"sex\",\n", " \"capital_gain\",\n", " \"capital_loss\",\n", " \"hours_per_week\",\n", " \"native_country\",\n", " \"income_bracket\",\n", "]\n", "\n", "# This is the label (target) we want to predict.\n", "_LABEL_COLUMN = \"income_bracket\"" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Hhsa1-6qVD0n" }, "source": [ "#### Unused Features and Features that are Biased\n", "\n", "These columns can be skipped when using SQL when creating the training table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "A_mQNFOiZxWI" }, "outputs": [], "source": [ "\"\"\"\n", "These are columns we will not use as features for training. There are many\n", "reasons not to use certain attributes of data for training. Perhaps their\n", "values are noisy or inconsistent, or perhaps they encode bias that we do not\n", "want our model to learn. For a deep dive into the features of this Census\n", "dataset and the challenges they pose, see the Introduction to ML Fairness\n", "notebook: \n", "(\"https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/\"\n", "\"exercises/intro_to_fairness.ipynb\")\n", "\"\"\"\n", "\n", "UNUSED_COLUMNS = [\"functional_weight\", \"education\", \"sex\"]" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bSJjhQ8ZyDae" }, "source": [ "## Data preparation\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "6NIOrdciGKuk" }, "source": [ "### Process Numerical and Categorical columns\n", "\n", "The Census datasets contains both numbers and strings\n", "we need to convert string data into numbers to be able to train the model. \n", "\n", "**BigQuery ML supports handling categorical data:**\n", "\n", "For all non-numeric columns other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.\n", "\n", "Learn more about [feature engineering](https://developers.google.com/machine-learning/crash-course/representation/feature-engineering) and [bias in data](https://developers.google.com/machine-learning/crash-course/fairness/types-of-bias)." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QGcyDe8QvrJN" }, "source": [ "### Create BigQuery client\n", "\n", "Create a BigQuery client to bundle configuration needed for API requests." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "rfa4uapGv5DS" }, "outputs": [], "source": [ "client = bigquery.Client(location=\"US\", project=PROJECT_ID)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ehl6LCTSF38o" }, "source": [ "### Create BigQuery dataset\n", "\n", "We will create a dataset called `census_tutorial`,inside this dataset we will create our BQML model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "59aaPPqEF-XD" }, "outputs": [], "source": [ "# Dataset and table information\n", "dataset_name = \"census_tutorial\"\n", "\n", "# Create BigQuery dataset\n", "dataset = client.create_dataset(dataset_name)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "AS16WHowx9nx" }, "source": [ "### Extract Training and Evaluation dataframes:\n", "\n", "In this case we will split our data in 80/10/10 for training, validation and testing.\n", "\n", "https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning\n", "\n", "For machine learning, you want repeatable sampling of the data you have in BigQuery. To get the validation data: change the < 8 in the query above to = 8, and for testing data, change it to = 9. This way, you get 10% of samples in validation and 10% in testing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "rvHqiFRU59dK" }, "outputs": [], "source": [ "# This query will process 4.8 MB when run:\n", "query = \"\"\"\n", "SELECT\n", " age,\n", " workclass,\n", " functional_weight,\n", " education,\n", " education_num,\n", " marital_status,\n", " occupation,\n", " relationship,\n", " race,\n", " sex,\n", " capital_gain,\n", " capital_loss,\n", " hours_per_week,\n", " native_country,\n", " income_bracket\n", "FROM\n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "\"\"\"\n", "dataset = client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ucgJ1S0-6DAT" }, "source": [ "### Training dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "gBaAG_iQyB8y" }, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " age,\n", " workclass,\n", " functional_weight,\n", " education,\n", " education_num,\n", " marital_status,\n", " occupation,\n", " relationship,\n", " race,\n", " sex,\n", " capital_gain,\n", " capital_loss,\n", " hours_per_week,\n", " native_country,\n", " income_bracket\n", "FROM\n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "WHERE\n", " MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) < 8\n", "\"\"\"\n", "train_dataset = client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "yk6vbgmE5Pe5" }, "source": [ "### Evaluation dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "8AG0cbK75LcC" }, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " age,\n", " workclass,\n", " functional_weight,\n", " education,\n", " education_num,\n", " marital_status,\n", " occupation,\n", " relationship,\n", " race,\n", " sex,\n", " capital_gain,\n", " capital_loss,\n", " hours_per_week,\n", " native_country,\n", " income_bracket\n", "FROM\n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "WHERE\n", " MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8\n", "\"\"\"\n", "eval_dataset = client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_MZdAmcC2KCN" }, "source": [ "### Testing dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "fhwXgTtRDtRc" }, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " age,\n", " workclass,\n", " functional_weight,\n", " education,\n", " education_num,\n", " marital_status,\n", " occupation,\n", " relationship,\n", " race,\n", " sex,\n", " capital_gain,\n", " capital_loss,\n", " hours_per_week,\n", " native_country,\n", " income_bracket\n", "FROM\n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "WHERE\n", " MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 9\n", "\"\"\"\n", "test_dataset = client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1iqAXKj6YaBk" }, "source": [ "#### DataFrame Length\n", "\n", "List length of training and testing data. This will validate that dataset contains data before we start processing it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "BrO4NJfLGfHE" }, "outputs": [], "source": [ "len(dataset), len(train_dataset), len(eval_dataset), len(test_dataset)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "YCgU09ifNSHp" }, "source": [ "### Visualize data\n", "\n", "Pandas provides the `.corr` method. Is used to find the pairwise correlation of all columns in the dataframe. Any `na` values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored. \n", "\n", "The term **correlation** refers to a mutual relationship or association between quantities.\n", "\n", "The closer ρ is to 1, the more an increase in one variable associates with an increase in the other. On the other hand, the closer ρ is to -1, the increase in one variable would result in a decrease in the other. Note that if X and Y are independent, then ρ is close to 0, but not vice versa! In other words, Pearson correlation can be small even if there is a strong relationship between two variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ImTjKHULNVC0" }, "outputs": [], "source": [ "hmap = dataset.corr(method=\"pearson\")\n", "plt.subplots(figsize=(12, 9))\n", "sns.heatmap(hmap, vmax=0.8, annot=True, cmap=\"BrBG\", square=True)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "noP57bOGjg77" }, "source": [ "When we look at the numerical features they **do not have** a strong correlation. \n", "The numerical attributes have a significant\n", "number of unique values: \n", "\n", " - `functional_weight`, has more than 28,000 unique values for a set of ~32000 values. \n", " \n", "This may indicate that this feature might not be a significant predictor." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "xc5UccG4PJXU" }, "outputs": [], "source": [ "# Visualize interactions\n", "sns.pairplot(dataset, hue=\"income_bracket\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Z7N_iIhWEM5i" }, "source": [ "In this plot you can see the different samples and their values, this is important when you are doing feature engineering." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "rj1cCbklyl3W" }, "outputs": [], "source": [ "# Count of >50K & <=50K\n", "sns.countplot(dataset[\"income_bracket\"])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "76dV6o1a1qrz" }, "source": [ "There is an imbalance in the proportion of labels, with 24720 values for <=50K, and 7841 for => 50K.\n", "however, 24% of the total should be sufficient to determine class through patterns. In other cases techniques like, [boosting](https://arxiv.org/abs/1706.01531) can be applied." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "DqGz3QGEupv5" }, "source": [ "## Feature Engineering\n", "\n", "### Numerical data analysis\n", "\n", "We will do a numeric data analysis to check on the different features and the correlation with income bracket.\n", "\n", "In the categorical variables, there are three attributes with unknown/missing values:\n", " - workclass (6%)\n", " - occupation (6%)\n", " - native_country (2%) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "UFcSHgSIQBYT" }, "outputs": [], "source": [ "# Checking Empty records (NULL) OR (?) and their percentage overall\n", "query = \"\"\"\n", "SELECT\n", " COUNTIF(workclass IS NULL \n", " OR LTRIM(workclass) LIKE '?') AS workclass,\n", " ROUND(COUNTIF(workclass IS NULL \n", " OR LTRIM(workclass) LIKE '?') / COUNT(workclass) * 100) \n", " AS workclass_percentage,\n", " COUNTIF(occupation IS NULL \n", " OR LTRIM(occupation) LIKE '?') AS occupation, \n", " ROUND(COUNTIF(occupation IS NULL \n", " OR LTRIM(occupation) LIKE '?') / COUNT(occupation) * 100) \n", " AS occupation_percentage,\n", " COUNTIF(native_country IS NULL \n", " OR LTRIM(native_country) LIKE '?') AS native_country,\n", " ROUND(COUNTIF(native_country IS NULL \n", " OR LTRIM(native_country) LIKE '?') / COUNT(native_country) * 100) \n", " AS native_country_percentage\n", "FROM\n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "RS_WL6e1rHMd" }, "outputs": [], "source": [ "# Checking for workclass values.\n", "query = \"\"\"\n", "SELECT\n", " workclass,\n", " COUNT(workclass) AS total_workclass\n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY workclass\n", "ORDER BY total_workclass DESC\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "vM6C3HDK2fbA" }, "outputs": [], "source": [ "# Checking for occupation values.\n", "query = \"\"\"\n", "SELECT\n", " occupation,\n", " COUNT(occupation) AS total_occupation\n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY occupation\n", "ORDER BY total_occupation DESC\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "1bJO0JVl41lv" }, "outputs": [], "source": [ "# Checking for native_country values.\n", "query = \"\"\"\n", "SELECT\n", " native_country,\n", " COUNT(native_country) AS total_native_country\n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY native_country\n", "ORDER BY total_native_country DESC\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zzGEXJPr3q4Q" }, "source": [ "After running the SQL queries above we have found:\n", "\n", " - `workclass` has almost 70% instances of Private, the Unknown (?) can be imputed with this value. \n", "\n", " - `native_country`, 90% of the instances are `United States` which can be used to impute for the Unknown (?) values. Same cannot be said about `occupation` column as the values are more distributed. \n", "\n", "- `occupation` the missing values account for 6% of the instances, it might be possible to remove these instances without replacement.\n", "\n", "Country:\n", " - Hong: Refers to Hong Kong\n", " - South: Probably refers to South Korea" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "-EIJYcrsr51g" }, "outputs": [], "source": [ "# Checking for capital_gain\n", "# Total records: 32561\n", "query = \"\"\"\n", "SELECT\n", " capital_gain,\n", " income_bracket,\n", " COUNT(capital_gain) AS capital_gain_records\n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY capital_gain, income_bracket \n", "ORDER BY capital_gain_records DESC\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "P5trzt1AsWSJ" }, "outputs": [], "source": [ "# Checking for capital_loss\n", "# Total records: 32561\n", "query = \"\"\"\n", "SELECT\n", " capital_loss,\n", " income_bracket,\n", " COUNT(capital_loss) AS capital_loss_records\n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY capital_loss, income_bracket\n", "ORDER BY capital_loss_records DESC\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1GP4rLoCrlwT" }, "source": [ "Let's take a look at the following numerical attributes:\n", "\n", " - `capital_gain` and `capital_loss` each have close to 100 unique values, although the majority of their instances have zero values. \n", "\n", " - `capital_gain` has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K. \n", "\n", " - `capital_loss` has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K. \n", "\n", "This implies that `capital_gain` or `capital_loss` will not make significant predictors either." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "78y91xr5vdJ_" }, "outputs": [], "source": [ "# Checking for education\n", "query = \"\"\"\n", "SELECT\n", " education,\n", " education_num \n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY education, education_num\n", "ORDER BY education_num\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "RaVgvdRkNGFs" }, "source": [ "\n", "- `education` and `education_number` are indicators of the same attribute, and are fully correlated with direct mapping, it makes sense to remove one of them during feature selection." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "veeWUL1UzHUt" }, "outputs": [], "source": [ "# Explore Education Num vs Income\n", "g = sns.catplot(\n", " x=\"education_num\",\n", " y=\"income_bracket\",\n", " data=dataset,\n", " kind=\"bar\",\n", " height=6,\n", " palette=\"muted\",\n", ")\n", "g.despine(left=True)\n", "g = g.set_ylabels(\">50K probability\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "2huTuTkA8xIV" }, "outputs": [], "source": [ "# Checking for marital_status\n", "query = \"\"\"\n", "SELECT\n", " marital_status \n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY marital_status\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Ch6UO75qpIBG" }, "outputs": [], "source": [ "# Checking for relationship\n", "query = \"\"\"\n", "SELECT\n", " relationship \n", "FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", "GROUP BY relationship\n", "\"\"\"\n", "client.query(query).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "S70kmKILpNVE" }, "source": [ "As relationship and marital status are features that describe similar status, \n", "we may be able to drop `relationship` and keep `marital_status`\n", "and create a new feature with new values." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "sLoxvkYPSnIE" }, "source": [ "## Model training\n", "\n", "We will start Model training using the Public Dataset:\n", "\n", "The `CREATE MODEL` clause is used to create and train the model named `census_tutorial.census_model`. \n", "\n", "```\n", "`CREATE OR REPLACE MODEL` \n", "```\n", "\n", "Creates and trains a model and replaces an existing model with the same name in the specified dataset. Documentation [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create)\n", "\n", "```\n", "`OPTIONS(model_type='logistic_reg', input_label_cols=['income_bracket'])`\n", "```\n", "\n", "Clause indicates that you are creating a logistic regression model. This option creates a logistic regression model or a multiclass logistic regression model. \n", "For logistic regression models, the label column must contain only two distinct values. When you create a multiclass logistic regression model, specify training data that contains more than two unique labels.\n", "\n", "- **model_type:** logistic_reg creates a logistic regression model or a multiclass logistic regression model.\n", "\n", "- **auto_class_weights:** By default, the training data used to create a multiclass logistic regression model is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily, which may not be desired. Class weights can be used to balance the class labels and can be used for logistic and multiclass logistic regressions. If set to true, the weights for each class are calculated in inverse proportion to the frequency of that class. To balance every class, use the following formula: TOTAL_INPUT_ROWS / (INPUT_ROWS_FOR_CLASS_N * NUMBER_OF_UNIQUE_CLASSES)\n", "\n", "- **data_split_method:** The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping. The default value is auto_split.\n", "\n", "- **input_label_cols:** The label column name(s) in the training data. input_label_cols accepts an array of strings, but only one array element is supported for linear_reg and logistic_reg models. If input_label_cols is unspecified, the column named \"label\" in the training data is used. If neither exists, the query fails.\n", "\n", "- **max_iterations:** The maximum number of training iterations (steps). The default value is 20.\n", "\n", "\n", "When you use a `CREATE MODEL` statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.\n", "\n", "When you use a `CREATE MODEL` statement, the label column cannot contain NULL values. If the label column contains NULL values, the query fails." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "A1tDH7_RDAIK" }, "source": [ "### Standardize data for Model Convergence\n", "\n", "1) For all numeric columns, BigQuery ML standardizes and centers the column at zero before passing it into training.\n", "\n", "2) In SQL query we skip the following columns for training data: \n", "\n", "`['functional_weight', 'education', 'sex', 'relationship']`\n", "\n", "3) Categorical features will be converted to numerical by BQML.\n", "\n", "4) The Unknown values in `workclass` and `native_country` are replaced with Private and United States respectively.\n", "The instances with Unknown values for `occupation` are removed.\n", "\n", "5) Drop `relationship` and use `marital_status` in a simplified manner. (You can use [Chi Square](https://en.wikipedia.org/wiki/Chi-squared_test) which is commonly used for testing relationships between categorical variables (martial_status vs relationship).\n", "\n", "6) Duplicates in the train set are removed. \n", "\n", "Result of the model creation will be an `Empty DataFrame` this is normal." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "tO15ZJdaOF9t" }, "outputs": [], "source": [ "# Train a BQML model\n", "train_query = \"\"\"\n", "CREATE OR REPLACE MODEL `census_tutorial.census_model`\n", " OPTIONS (\n", " model_type='logistic_reg',\n", " auto_class_weights=true,\n", " data_split_method='no_split',\n", " input_label_cols=['income_bracket'],\n", " max_iterations=15) AS\n", " SELECT\n", " age,\n", " CASE \n", " WHEN workclass IS NULL THEN 'Private' \n", " WHEN LTRIM(workclass) LIKE '?' THEN 'Private'\n", " ELSE workclass\n", " END AS workclass,\n", " CASE \n", " WHEN native_country IS NULL THEN 'United States' \n", " WHEN LTRIM(native_country) LIKE '?' THEN 'United States'\n", " ELSE native_country\n", " END AS native_country, \n", " CASE \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Never-married',\n", " 'Divorced',\n", " 'Separated',\n", " 'Widowed'\n", " ) THEN 'Single' \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Married-civ-spouse',\n", " 'Married-spouse-absent',\n", " 'Married-AF-spouse'\n", " ) THEN 'Married' \n", " ELSE NULL \n", " END AS marital_status,\n", " education_num,\n", " occupation,\n", " race, \n", " hours_per_week, \n", " income_bracket\n", " FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", " WHERE\n", " MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) < 8\n", " AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')\n", " GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9\n", "\"\"\"\n", "client.query(train_query)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zBSXiOmoSjb0" }, "source": [ "### Model information\n", "\n", "A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.\n", "\n", "Loss is the penalty for a bad prediction, a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "WCkGqyRMRq2Q" }, "outputs": [], "source": [ "training_info = \"\"\"\n", "SELECT\n", " training_run,\n", " iteration,\n", " loss,\n", " eval_loss,\n", " duration_ms,\n", " learning_rate\n", "FROM\n", " ML.TRAINING_INFO(MODEL `census_tutorial.census_model`)\n", "ORDER BY iteration ASC\n", "\"\"\"\n", "client.query(training_info).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pkI9JUkuU15e" }, "source": [ "## Model evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "XaWeKRrRU8si" }, "outputs": [], "source": [ "# Perform model evaluation\n", "query_evaluate = \"\"\"\n", "SELECT \n", " precision,\n", " recall,\n", " accuracy,\n", " f1_score,\n", " log_loss,\n", " roc_auc\n", "FROM ML.EVALUATE (MODEL `census_tutorial.census_model`, \n", " (\n", " SELECT\n", " age,\n", " CASE \n", " WHEN workclass IS NULL THEN 'Private' \n", " WHEN LTRIM(workclass) LIKE '?' THEN 'Private'\n", " ELSE workclass\n", " END AS workclass,\n", " CASE \n", " WHEN native_country IS NULL THEN 'United States' \n", " WHEN LTRIM(native_country) LIKE '?' THEN 'United States'\n", " ELSE native_country\n", " END AS native_country, \n", " CASE \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Never-married',\n", " 'Divorced',\n", " 'Separated',\n", " 'Widowed'\n", " ) THEN 'Single' \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Married-civ-spouse',\n", " 'Married-spouse-absent',\n", " 'Married-AF-spouse'\n", " ) THEN 'Married' \n", " ELSE NULL \n", " END AS marital_status,\n", " education_num,\n", " occupation,\n", " race, \n", " hours_per_week, \n", " income_bracket\n", " FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", " WHERE\n", " MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8\n", " AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')\n", " GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9\n", " ))\n", "\"\"\"\n", "evaluation_job = client.query(query_evaluate).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-pzyB2ByXJj-" }, "source": [ "### Model results\n", "\n", "By examining the ROC accuracy (roc_auc) printed during evaluation, you should see that your model ultimately learned to predict a person's income bracket with around 78% accuracy. Can you improve it?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "8JQ8eAq_XNOg" }, "outputs": [], "source": [ "evaluation_job" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "F8Tl0zmdJq7U" }, "source": [ "### ROC \n", "\n", "The output ML.ROC_CURVE [function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-roc) includes multiple rows with metrics for different threshold values for the model. The metrics include:\n", "\n", "- threshold\n", "- recall\n", "- false_positive_rate\n", "- true_positives\n", "- false_positives\n", "- true_negatives\n", "- false_negatives\n", "\n", "More information about these metrics [here](https://en.wikipedia.org/wiki/Precision_and_recall).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Gv6PQ-9HJu9d" }, "outputs": [], "source": [ "# Perform model evaluation\n", "query_roc_curve = \"\"\"\n", "SELECT\n", " threshold,\n", " recall,\n", " false_positive_rate,\n", " true_positives,\n", " false_positives,\n", " true_negatives,\n", " false_negatives\n", "FROM\n", " ML.ROC_CURVE(MODEL `census_tutorial.census_model`,\n", " (\n", " SELECT\n", " age,\n", " CASE \n", " WHEN workclass IS NULL THEN 'Private' \n", " WHEN LTRIM(workclass) LIKE '?' THEN 'Private'\n", " ELSE workclass\n", " END AS workclass,\n", " CASE \n", " WHEN native_country IS NULL THEN 'United States' \n", " WHEN LTRIM(native_country) LIKE '?' THEN 'United States'\n", " ELSE native_country\n", " END AS native_country, \n", " CASE \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Never-married',\n", " 'Divorced',\n", " 'Separated',\n", " 'Widowed'\n", " ) THEN 'Single' \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Married-civ-spouse',\n", " 'Married-spouse-absent',\n", " 'Married-AF-spouse'\n", " ) THEN 'Married' \n", " ELSE NULL \n", " END AS marital_status,\n", " education_num,\n", " occupation,\n", " race, \n", " hours_per_week, \n", " income_bracket\n", " FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", " WHERE\n", " MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8\n", " AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')\n", " GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9))\n", "\"\"\"\n", "roc_curve_job = client.query(query_roc_curve).to_dataframe()\n", "roc_curve_job" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fW7vTPm2pd2l" }, "source": [ "### Visualize Model history" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "lfuLSeSXucI4" }, "outputs": [], "source": [ "# Visualize History for Accuracy.\n", "plt.title(\"BigQuery ML Model accuracy\")\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.02])\n", "plt.xlabel(\"False Positive Rate\")\n", "plt.ylabel(\"True Positive Rate\")\n", "fpr = roc_curve_job[\"false_positive_rate\"]\n", "tpr = roc_curve_job[\"recall\"]\n", "plt.plot(fpr, tpr, color=\"darkorange\")\n", "plt.plot([0, 1], [0, 1], color=\"navy\", linestyle=\"--\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-RxaXEzfObQU" }, "source": [ "## Model predictions\n", "\n", "#### Generate predictions\n", "\n", "Now you will use your model to predict outcomes. The following query uses the ML.PREDICT. The query returns these columns:\n", "\n", " - income_bracket\n", " - predicted_income_bracket\n", " - predicted_income_bracket_probs\n", " - All other columns specified in query.\n", "\n", "**Note:** Normally for prediction, your label will be empty, in this example\n", " we use it to be able to compare the model result vs expected label." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "7Ex6PXtLOGl2" }, "outputs": [], "source": [ "query_prediction = \"\"\"\n", "SELECT\n", " income_bracket,\n", " predicted_income_bracket,\n", " predicted_income_bracket_probs\n", "FROM\n", " ML.PREDICT(MODEL `census_tutorial.census_model`,\n", " (\n", " SELECT\n", " age,\n", " CASE \n", " WHEN workclass IS NULL THEN 'Private' \n", " WHEN LTRIM(workclass) LIKE '?' THEN 'Private'\n", " ELSE workclass\n", " END AS workclass,\n", " CASE \n", " WHEN native_country IS NULL THEN 'United States' \n", " WHEN LTRIM(native_country) LIKE '?' THEN 'United States'\n", " ELSE native_country\n", " END AS native_country, \n", " CASE \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Never-married',\n", " 'Divorced',\n", " 'Separated',\n", " 'Widowed'\n", " ) THEN 'Single' \n", " WHEN LTRIM(marital_status) IN \n", " (\n", " 'Married-civ-spouse',\n", " 'Married-spouse-absent',\n", " 'Married-AF-spouse'\n", " ) THEN 'Married' \n", " ELSE NULL \n", " END AS marital_status,\n", " education_num,\n", " occupation,\n", " race, \n", " hours_per_week, \n", " income_bracket\n", " FROM \n", " `bigquery-public-data.ml_datasets.census_adult_income`\n", " WHERE\n", " MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 9\n", " AND occupation IS NOT NULL AND LTRIM(occupation) NOT LIKE '?%'\n", " GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9\n", " ))\n", "\"\"\"\n", "predictions = client.query(query_prediction).to_dataframe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "AYcvzsjtydds" }, "source": [ "We will see a few predictions and the expected result:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "GffyYYZ3hVBL" }, "outputs": [], "source": [ "predictions[[\"income_bracket\", \"predicted_income_bracket\"]].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "yqsessdug7-C" }, "outputs": [], "source": [ "predictions[\"predicted_income_bracket_probs\"].head()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "cpKsvKAipZvu" }, "source": [ "In this case the first prediction probability is ~ 0.53, which corresponds to `>50K`. \n", "As you can see is very easy to load data into BigQuery and create a model to start training and serving using SQL language only. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "MedeDGPFgXFq" }, "source": [ "**Count the number of correct predictions:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Wn5-E2-qeb6I" }, "outputs": [], "source": [ "_count = (\n", " predictions[\"income_bracket\"].str.strip().str.lower()\n", " == predictions[\"predicted_income_bracket\"].str.strip().str.lower()\n", ")\n", "# Group predictions:\n", "_count.value_counts(normalize=True)\n", "# Note: Set normalize=False to see grouped results." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "olSZbtCMjX6h" }, "source": [ "## Conclusion\n", "\n", "Using BigQuery to create Machine Learning has never been so easy, with BigQuery ML now you can start using the power of the Cloud and SQL language to \n", "create different Machine Learning solutions to solve your different business challenges." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "K0UXLWaBJnrY" }, "source": [ "## Cleaning up\n", "To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:\n", "\n", "- You can delete the project you created.\n", "- Or you can keep the project and delete the dataset.\n", "- Make sure you check the Pricing section when using this tutorial.\n", "\n", "## Support\n", "https://cloud.google.com/bigquery/support\n" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "Getting started: Feature Engineering, Training and Prediction with BQML", "provenance": [], "toc_visible": true, "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/tutorials/bigquery/Getting started with BigQuery ML Feature Engineering.ipynb (1,732 lines of code) (raw):